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Chapter 5 

H Constructing Learning Algorithms 



To implement the SRM inductive principle in learning algorithms one has 
to minimize the risk in a given set of functions by controlling two factors: 
the value of the empirical risk and the value of the confidence interval. 

Developing such methods is the goal of the theory for constructing learn- 
ing algorithms. 

In this chapter we describe learning algorithms for pattern recognition 
and consider their generalizations for the regression estimation problem. 



5-1 WHY CAN LEARNING MACHINES GENERALIZE? 

The generalization ability of learning machines is based on the factors de- 
scribed in the theory for controlling the generalization ability of learning 
processes. According to this theory, to guarantee a high level of generaliza- 
tion ability of the learning process one has to construct a structure 

Si C $2 C, . . . , C S 

on the set of loss functions S = {Q(z, a), a € A} and then choose both an 
appropriate element of the structure and a function Q(z, a£) 6 S* in 
this element that minimizes the corresponding bounds, for example, bound 
(4.1). The bound (4.1) can be rewritten in the simple form 



(5.1) 



120 5. Constructing Learning Algorithms 

where the first term is the empirical risk and the second term is the confi- 
dence interval. 

There are two constructive approaches to minimizing the right-hand side 
of inequality (5.1). 

In the first approach, during the design of the learning machine one 
determines a set of admissible functions with some VC dimension h*. Far 
a given amount i of training data, the value h* determines the confidence 
interval $(£) for the machine. Choosing an appropriate element of the 
structure is therefore a problem of designing the machine for a specific 
amount of data. 

During the learning process this machine minimizes the first term of the 
bound (5.1) (the number of errors on the training set). 

If for a given amount of training data one designs too complex a machine, 
the confidence interval $(£) will be large. In this case even if one could 
minimize the empirical risk down to zero the number of errors on the test 
set could still be large. This phenomenon is called overfitting. 

To avoid overfitting (to get a small confidence interval) one has to con- 
struct machines with small VC dimension. On the other hand, if the set of 
functions has a small VC dimension, then it is difficult to approximate the 
training data (to get a small value for the first term in inequality (5.1)). 
To obtain a small approximation error and simultaneously keep a small 
confidence interval one has to choose the architecture of the machine to 
reflect a priori knowledge about the problem at hand. 

Thus, to solve the problem at hand by these types of machines, one first 
has to find the appropriate architecture of the learning machine (which is 
a result of the trade off between overfitting and poor approximation) and 
second, find in this machine the function that minimizes the number of 
errors on the training data. This approach to minimizing the right-hand 
side of inequality (5.1) can be described as follows: 

Keep the confidence interval fixed (by choosing an appropriate construc- 
tion of machine) and minimize the empirical risk. 

The second approach to the problem of minimizing the right-hand side 
of inequality (5.1) can be described as follows: 

Keep the value of the empirical risk fixed (say equal to zero) and minimize 
the confidence interval. 

Below we consider two different types of learning machines that imple- 
ment these two approaches: 

(i) Neural Networks (which implement the first approach), and 

(ii) Support Vector machines (which implement the second approach). 

Both types of learning machines are generalizations of the learning ma- 
chines with a set of linear indicator functions constructed in the 1960s; 
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5.2 SIGMOID APPROXIMATION OF INDICATOR 
FUNCTIONS 

Consider the problem of minimizing the empirical risk on the set of linear 
indicator functions 

/(x, w) = sign {(w • x)} , weJRTy (5.2) 
where (w • x) denotes a inner product between vectors w and x. Let 

!*))••• ifo,Vt) 

be a training set, where xj is a vector, and yj € {1, -1}, j = 1, . . . , £. 

The goal is to find the vector of parameters w 0 (weights) which minimize 
the empirical risk functional 



JfempM = \ J2(Vj - f(x jy w)) 2 . 



i=i 



(5.3) 



If the training set is separable without error (i.e. the empirical risk can 
become zero) then there exists a finite step procedure that allows us to find 
such a vector w 0y for example the procedure that Rosenblatt proposed for 
the perception (see the Introduction). 

The problem arises when the training set cannot be separated without 
errors. In this case the problem of separating the training data with the 
smallest number of errors is NP-complete. Moreover, one cannot apply reg- 
ular gradient based procedures to find a local minimum D f functional (5.3), 
since for this functional the gradient is either equal to zero or undefined. 

Therefore, the idea was proposed to approximate the indicator functions 
(5.2) by so-called sigmoid functions (see Fig. 0.3 ) 



f(x,w) = S{(wx)}, 
where S(u) is a smooth monotonia function such that 
S(-oo) = -l, S(+oo) = l, 

for example, 

et \ ± v exp(u) - exp(-ii) 

S(u) = tanhti= ; [ ^ — 

exp(u) + exp(-u) 

For the set of sigmoid functions, the empirical risk functional 

1 i=i 



(5.4) 
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is smooth in w. It has a gradient 

grad^ilempM = -7 £ fe " S «™ ' ^ ' 

and therefore it can be minimized using standard gradient-based methods, 
for example, the gradient descent method: 

u>new = ^old ~~ tOs 1 * 1 #emp(™ 0 ld)> 

where 7(-) = 7(n) > 0 is a value which depends on the iteration number 
n. For convergence of the gradient descent method to local minima it is 
sufficient that the values of gradient are bounded and that coefficients 7(n) 
satisfies the following conditions: 

f>(n) = oo, f]r ! (n)<oo. 

n=l n=l 

Thus, the idea is to use the sigmoid approximation at the stage of esti- 
mating the coefficients, and use the threshold functions (with the obtained 
coefficients) for the last neuron at the stage of recognition. 



5-3 NEURAL NETWORKS 

In this section we consider classical neural networks, which implement the 
first strategy: keep the confidence interval fixed and minimize the empirical 
risk. 

This idea is used to estimate the weights of all neurons of a multi-layer 
perceptron (Neural Network). Instead of linear indicator functions (single 
neurons) in the networks one considers a set of sigmoid functions. 

The method for calculating the gradient of the empirical risk for the sig- 
moid approximation of neural networks, called the back-propagation method 
was proposed 1 in 1986 (Rumelhart, Hinton, and Williams, 1986), (LeCun, 
1986). 

Using this gradient, one can iteratively modify the coefficients (weights) 
of a neural net on the basis of standard gradient-based procedures. 



5.3 J The Back-Propagation Method 

To describe the back-propagation method we use the following notations 
(Fig. 5.1): 



1 See footnote 5 on page 12. 
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Network is a combination of several levels of sigmoid 
one layer form the input for the next layer. 

ntains m + l layers: the first layer x(0) describes 
= (x 1 , ...,x n ). We denote the input vector by 

* = (x}{0),...x?{0))> i = l £ y 

he input vector Xi(0) on the Jfeth layer by 

0 = (*i l (*),.-.,«T fc (fc)), = 

>y njb the dimensionality of the vectors i = 
...,m - 1 can be any number, but % - 1). 

onnected with the layer k through the (n* x n*_i) 

)*<(*-!)}, fe=l,2,...,m, i = l,..., A (5.5) 
- 1)} defines the sigmoid function of the vector 
= u;(fc)x < (fc-l) = (uj(fc),...,< fc (fc)) 
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FIGURE 5.2. The Optimal separating hyperplane is the one that separates the 
data with maximal margin. 

5.4.2 The Structure of Canonical Hyperplanes 

Now let separating hyperplanes is defined on the set of vectors 

X* = a?i, . . . ,x r , 

bounded by a sphere of the radius R 

\xi-a\<R> XiZX* 

(a is the center of the sphere)* Consider a set of hyperplanes in canonical 
form (with respect to these vectors) defined by the pairs (tu, b) satisfying 
the condition 

mm J(u>-a?0 + &| = l. 

Note that the set of canonical separating hyperplanes coincides with the 
set of all separating hyperplanes. It only specifies the formalization of the 
parameters of hyperplanes. 

The idea of constructing a machine that fixes the empirical risk and 
mmimizes the confidence interval is based on the existence of the following 
bound on the VC dimension of canonical hyperplanes. 

Theorem 5.1. A subset of canonical hyperplanes 
f(x, w, b) = sign{(w ■ x) + 6}, 
defined on X* and satisfying the constraint 

\\w\\<A 

has the VC dimension h bounded by the inequality 
h<min([R 2 A 2 ],n) + l. 
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In Section 3.5 we stated that the VC dimension of the set of hyperplanes 
is equal to n + 1, where n is dimensionality of the space. However, the 
VC dimension of the subset of the set of hyperplanes, with canonical form 
satisfying |iu| 2 < A 2 > can be less. 2 

Below we consider hyperplanes only in canonical form, constructed on 
the basis of the training vector s X* ± si,...,s*. 3 F ° r simplicity we call 
them hyperplanes. 

Let us construct the structure on the set of hyperplanes by increasing the 
norm of the weights w. Then in order to obtain the smallest probability 
of error on the test set, we choose the hyperplane from the element of 
the structure which separates the training data and whose element of the 
structure gives the smallest bouad on the VC dimension, that is, with the 
smallest norm of weights. 
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5,5 CONSTRUCTING THE OPTIMAL HYPERPLANE 

To construct the Optimal hyperplane one has to separate the vectors Xi of 
the training set 

(yi>xi),...,(jtt,x/) 

belonging to two different classes y e {-1, 1} using the hyperplane with 
the smallest norm of coefficients. 

To find this hyperplane one has to solve the following quadratic program- 
ming problem: minimize the functional 

${w) = hww) (5-10) 

under the constraints of inequality type 

Vi[(xi • w) + b] > 1, 1 = 1,2,..., L (5.U) 

The solution to this optimization problem is given by the saddle point of 
the Lagrange functional (Lagrangian): 

1 1 

L(w y 6, a) = ±{w • tu) - £ <*{[(*, w) + b) yi - 1} , (5.12) 



where the a* are Lagrange multipliers. The Lagrangian has to be minimiz ed 
with respect to w } b and maximized with respect to a* > 0. 



2 In Section 5.7 we describe a separating hyperplane in 10 1 dimensional space 
with relatively small estimate of the VC dimension (wlO ). 
3 In Section 5.11 we will discuss this choice of the set X*, 
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solutions w 0 , b, 
6L(w a ,bo,aP) 



In the saddle point, the solutions w 0 , bo, and a 0 should satisfy the 
ditions 



con- 



db 

dL(wo,bo,of>) 
dw 



= 0, 



= 0. 



Rewriting these equations in explicit form one obtains the following prop- 
erties of the Optimal hyperplane: 

(i) The coefficients a? for the Optimal hyperplane should satisfy the 
constraints 



1^=0, <*?>0, » = !,...,/ 



(5.13) 



(first equation). 



(ii) The Optimal hyperplane (vector w 0 ) is a linear combination of the 
vectors of the training set. 



i 

w o = 52viO$Xi, a$>0, i = l,...,t 

i=l 



(5.14) 



(second equation). 



(m) Moreover, only the so-called support vectors can have nonzero coeffi- 
cients a, in the expansion of u*. The support vectors are the vectors 
for which, in inequality (5.11), the equality is achieved. Therefore we 
obtain 



Wq = 



support vectors 



Viola, a? > 0. 



(5.15) 



This fact follows from the classical Kuhn-Tucker theorem, accord- 
ing to which the necessary and sufficient conditions for the Optimal 
hyperplane are that the separating hyperplane satisfy the conditions: 



a? {[(*< • wo) + bo] yi - 1} = 0, * = !,...,£ 



(5.16) 



Putting the expression for w 0 into the Lagrangian and taking into account 
the Kuhn-Tucker conditions, one obtains the functional 



1 ' 



(5.17) 
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It remains to maximize this functional in the non-negative quadrant 

under the constraint 

According to Eq. (5.15) the Lagrange multipliers and support vectors deter- 
tS^Opthnal [hyperplane. Thus, to construct the 
one has to solve a simple quadratic programming probem: maxunize the 
Quadratic form (5.17) under constraints 4 (5.18) and (5.19). 
TeIT= (a?, ... a9) be a solution to this quadratic optimization prob- 
leZvL the norm of the vector w 0 corresponding to the Optimal hype* 
plane equals: 

\w 0 \ 2 = 2W{ao)= £ aWixi-x^yiyy 
support vectors 

The separating rule, based on the Optimal hyperplane, is the following 
indicator function 



= sign ^ 



^support vectors 



(5.20) 



where z< are the support vectors, a? are the corresponding Lagrange coef- 
ficients, and 5o is the constant (threshold) 

h 0 = i[( Wo -x'(l)) + (t«o-x*(-l))], 

It 

where we denote by x' (1) some (any) support vector belonging ^ the first 
class and we denote by x'(-l) a support vector bdongmg to the second 
class (Vapnik and Chervonenkis, 1974), (Vapnik, 1979). 

r 5.5.1 Generalization for the Nonseparable Case 
To construct the Optimal type hyperplane in the case when the data are 
linearly nonseparable, we introduce non-negative variables ft > 0 and a 

'This quadratic programming problem is simple because »J«*^^ 
straints For the solution of this problem, one can use special methods whic* are 

S3KSSKKSSSB5BS 

periments 3% to 5%). 
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function 



i=4 



with parameter a > 0. 
Let us minimiz e the functional F a (£) subject to constraints 

Vi((w • *<) + *) > 1 - 6, < = 1» 2, . . . (5.21) 

and one more constraint 

(wto)<Cn. (5.22) 

For sufficiently small a > 0 the solution to this optimization problem 
defines a hyperplane that minimizes number of training errors under con- 
dition that the parameters of this hyperplane belong to the subset (5.22) 
(to the element of the structure 

S« = {(w*) + b: (w-w)< Cn} 

determined by constant Cn). 

Fbr computational reasons, however, we consider the case a — 1. This 
case corresponds to the smallest a > 0 that is still computationally simple. 
We call this hyperplane the Generalized Optimal hyperplane. 

1. One can show (using the technique described above) that the Gener- 
alized Optimal hyperplane is determined by the vector 



w 



where parameters a<, i = 1, and C* are the solutions to the following 
convex optimization problem: 
Maximize the functional 

i 1 1 CnC* 

W(a, C*) = - ^ £ * x i) " — 



i=i 



subject to constraints 



J] wot, = 0 

i=l 

0<a 4 <l, i = l,...,/ 
<7*>0 





(5.21) 



(5.22) 
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5.6. Support Vector (SV) Machines 

m i'Si^ 8implify . C ?? U ^f ti0nS ° De 0311 induce the following (slightly 

n^S^ ° fth ? G ^ raUZed ° ptimal h ™^° (CortJaLdVap^ 
nik, 1995). The Generalized Optimal hyperplane is determined by the vec- 
tor w that niininuzes the functional 



CM 



(here C is a given value) subject to constraint (5.21) 
The technique of solution of this quadratic optimization problem is al- 

coefficients of the generalized Optimal hyperplane 

nL^i!°f fin<ithe . P !I ameter80(< ' < = Wane that maximize the same 
quadratic form as in the separable case 

' 1 ' 

»=i <,i=i 
under slightly different constraints 

0<<*i<C, i = l,...,e, 
t 

»=1 

As in the separable case, only some of the coefficients ^,1 = 1 £ differ 
from zero. They determine the support vectors. 

Note that if the coefficient C in the functional #(«,,£) is equal to the 
optimal value of parameter C- for minimization of the functioZ^) 

C = C\ 

pT^trt^T 1° ^ ^^T*" 0 " I™"™ (defined by the functional 
^1(0 and by the functional *(w,^)) coincide. 



5.6 SUPPORT VECTOR (SV) MACHINES 

The Support Vector (SV) machine implements the following idea: it maps 
the input vectors x into a high-dimensional feature space Z through some 
nonlinear mapping, chosen a priori. In this space, an OptirndX^ 
hyperplane is constructed (Fig. 5.3). separating 
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Optimal hyperplane in the feature space 




m ^ S ^ # # Input space 



FIGURE 5.3. The SV machine maps the input space into a high-dimensional 
feature space and then constructs an Optimal hyperplane in the feature space. 

Example. To construct a decision surface corresponding to a polynomial 
of degree two, one can create a feature space Z which has N = n ( w + 3 ) 
coordinates of the form 



«1\2 



9 Z =X 



^ = (x 1 ) 2 ,...,2 2n = (x n ) 2 



n coordinates , 
, n coordinates , 

coordinates, 



where x = (x 1 ,...^"). The separating hyperplane constructed in this 
space is a second degree polynomial in the input space. / 

Two problems arise in the above approach: one conceptual and one tech- 
nical. 

(i) How to find a separating hyperplane that will generalize well? 
(The conceptual problem.) 

The dimensionality of the feature space will be huge, and a hyperplane 
that separates the training data will not necessarily generalize well. 5 

(ii) How to treat computationally such high-dimensional spaces? 
(The technical problem,) 

To construct a polynomial of degree 4 or 5 in a 200 dimensional 
space it is necessary to construct hyperplanes in a billion dimensional 
feature space. How can this "curse of dimensionality* be overcome? 



5 Recall Fisher's concern about the small amount of data for constructing a 
quadratic disc riminant function in classical discriminant analysis (Section 1.9). 
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5.6.1 Generalization in High-Dimensional Space 

The conceptual part of this problem can be solved by constructing the 
Optimal hyperplane. 

According to Theorem 5.1, if it happens that in the high-dimensional 
input space one can construct a separating hyperplane with a small value 
of [i* 2 j4 2 ], the VC dimension of the corresponding element of the structure 
will be small, and therefore the generalization ability of the constructed 
hyperplane will be high. 

Furthermore, the following theorem holds. 

Theorem 5.2. If ike training vectors are separated by the Optimal hy- 
perplane (or generalized Optimal hyperplane), then the expectation value of 
the probability of committing an error on a test example is bounded by the 
ratio of the expectation of the number of support vectors to the number of 
examples in the training set: 

^[number of support vectors] . g ^ 

[ terror ;j _ ^^er Q f training vectors) - 1 " 

This bound depends neither on the dimensionality of the space, nor on 
the norm , of the vector of coefficients, nor on the bound of the norm of 
the input Vectors. Therefore, if the Optimal hyperplane can be constructed 
from a small number of support vectors relative to the training set size, 
the generalization ability will be high — even in an infinite-dimensional 
space. 6 

5.6.2 Convolution of the Inner Product 

However, even if the Optimal hyperplane generalizes well and can theoret- 
ically be found, the technical problem of how to treat the high-dimensional 
feature space remains. 

In 1992 it was observed (Boser, Guyon, and Vapnik, 1992) that for con- 
structing the Optimal separating hyperplane in the feature space Z, one 



6 One can compare the result of this theorem to result of analysis of the fol- 
lowing compression scheme, lb construct the Optimal separating hyperplane one 
only needs to specify among the training data the support vectors and its classifi- 
cation. This requires: « (lg 3 ml bits to specify the number m of support vectors, 
[lg 2 CT1 bits to specify the support vectors; and flg 2 C£ l l bits to specify rep- 
resentatives of the first class among the support vectors. Therefore for m << £ 
and mi « m/2 the compression coefficient is 

m(lg 2 l/m + l) 
- . 

The expectation of this coefficient should be compared to the value Em/(£ - 1) 
(the right hand side of inequality (5.23)). 
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1 . 






V 

does not need to consider tne teature space in explicit jornu wne uuiy uo& 
to be able to calculate the inner products between support vectors and the 
vectors of the feature space (Eqs. (5.17) and (5.20)). 
Consider a general expression for the inner product in Hilbert space 7 
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sej 




ji (z i -z)=K(x,x i ) i 


sul 

r 

hy] 




1 where z is the image in feature space of the vector x in input space, 
j According to Hilbert-Schmidt theory, K(x,Xi) can be any symmetric 
function satisfying the following general conditions (Courant and Hilbert, 
; 1953): 

| Theorem 5.3. (Mercer) To guarantee that the symmetric function K(u, v) 
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1 j tritfc postttve coefficients a k >0 (Le., K(u,v) describes a inner product in 
some feature space), it is necessary and sufficient that the condition 
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be valid for all g^O for which 

J g z {u)du < oo. 
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5.5.5 Constructing SV Machines 


SUE 
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The convolution of the inner product allows the construction of decision 
functions that are nonlinear in the input space 


0 
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/(*)=sign j £ yiOi^xJ-fc), (5.25) 
\support vectors / 
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and that are equivalent to linear decision functions in the high-dimensional 
feature space ifa(x), . . , ^0*0 (K( x u x) is a convolution of the* inner prod- 
uct for this feature space). 


! 

j 

i. 

■ 1 

! i 

I j 


7 This idea was used in 1964 by Aizerman, Braverman, and Rozonoer in then- 
analysis of the convergence properties of the method of Potential functions (Aiz- 
erman, Braverman, and Rozonoer, 1964, 1970). It happened at the same time 
(1965) when the method of the Optimal hyperplane was developed (Vapnik and 
ChervonenJris 1965). However, combining these two ideas, which lead to the SV 
machines, was only done in 1992. 
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To find the coefficients a< in the separable case (analogously in the non- 
separable case) it is sufficient to find the maximum of the functional 



(5.26) 



subject to the constraints 

7 = 0, 

«i>0, * = l,2,...,i. (5.27) 

This functional coincides with the functional for finding the Optimal 
hyperplane, except for the form of the inner products: instead of inner 
products (x» • Xj) in Eq. (5.17), we now use the convolution of the inner 
products K{xi , Xj ). 

The learning machines which construct decision functions of the type 
(5.25) are called Support Vector (SV) Machines. (With this name we stress 
the idea of expanding the solution on support vectors. In SV machines the 
complexity of the construction depends on the number of support vectors 
rather than on the dimensionality of the feature space.) The scheme of SV 
machines is shown in Fig. 5.4. 



5.6.4 Examples of SV Machines 

Using different functions for convolution of the inner products K(x^ a?»), one 
can construct learning machines with different types of nonlinear decision 
surfaces in input space. Below, we consider three types of learning machines: 

(i) Polynomial Learning Machines, 

(ii) Radial Basis Functions Machines, and 

(iii) Two Layer Neural Networks. 

For simplicity we consider here the regime where the training vectors are 
separated without error. 

Note that the support vector machines implement the SRM principle. 
Indeed, let 

*(x) = (^i(x),...,^(^)) 

be a feature space and w = (wi , . . . , wn) be a vector of weights determining 
a hyperplane in this space. Consider a structure on the set of hyperplanes 
with elements S* containing the functions satisfying the conditions 

R 2 \w\ 2 < fe, 
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F (x N , x) 



Decision rule 
N 

y = sign ( 2 yj a t F ( xj ,a ) - b ) 

i=l 



Weights y iai> ... ,j^a N 

Nonlinear transformation 
based on support vectors 
x L , ... , x N 



x n Input vector x = ( x 1 , x n ) 



FIGURE 5.4. The two-layer SV machine is a compact realization of an Optimal 
hyperplane in the high-dimensional feature space Z. 

where R is the radius of the smallest sphere that contains the vectors ^(x), 
\w\ is the norm of the weights (we use canonical hyperplanes in feature 
space with respect to the vectors z = where Xi are the elements of 
the training data). 

According to Theorem 5.1 (now applied in the feature space), k gives an 
estimate of the VC dimension of the set of functions £*• 

The SV machine separates without error the training data 

yi [(*{xi) . w) + b] > 1, 2/i = {-hi, -1}, i = 1,2, . . . 

and has a minimal norm \w\. 

In other words, the SV machine separates the training data using func- 
tions from element with the smallest estimate of the VC dimension.^ 

Recall that in the feature space the equality 



Ki 2 = Y^ a i a j K ( x i> x i)yw 



(5.28) 



holds true. To control the generalization ability of the machine (to min- 
imize the probability of test errors) one has to construct the separating 
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(5.28) 
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hyperplane that minimizes the functional 



(5.29) 



Indeed, for separating hyperplanes the probability of test errors with prob- 
ability 1 — t) is bounded by the expression 

' 5=1 h0n^ + l)-lni|/4 

if 

The right-hand side of £ attains its miniTmim when h/l is minimal. We es- 
timate the minimum of h/£ by estimating h by A e *t = -ft 2 |ti>o| 2 - Tb estimate 
this functional it is sufficient to estimate |ti; 0 | 2 (say by expression (5.28)) 
and estimate R 2 by finding 

R 2 = B?(K) = minmax [K(x { , xj + K(a, a) - 2K(x iy a)] . (5.30) 

a Xi 

Polynomial Learning Machine 

To construct polynomial decision rules of degree d, one can use the fol- 
lowing function for convolution of the inner product: 



(5.31) 



This symmetric function satisfies the conditions of Theorem 5.3, therefore 
it describes a convolution of the inner product in the feature space that con- 
tains all products Xi-Xj • Xk up to degree d. Using the described technique, 
one constructs a decision function of the form 



/(x, a) = sign J 
V 



support vectors 



-)■ 



which is a factorization of (^-dimension polynomials in n-dimensional input 
space. 

In spite of the very high dimensionality of the feature space (polynomials 
of degree d in nr dimensional input space have 0(n d ) free parameters) the 
estimate of the VC dimension of the subset of polynomials that solve real 
life problems can be low. 

As described above to estimate the VC dimension of the element of the 
structure from which the decision function is chosen, one has only to esti- 
mate the radius R of the smallest sphere that contains the training data, 
and the norm of weights in feature space (Theorem 5.1). 

Note that both the radius R = R(d) and the norm of weights in the 
feature space depends on the degree of the polynomial. 

This gives the opportunity to choose the best degree of polynomial for 
the given data. 



140 5. Constructing Learning Algorithms 

To make a local polynomial approximation in the neighborhood of a point 
of interest x 0 , let ns consider the hard threshold neighborhood function 
(4.16). According to the theory of local algorithms, one chooses a ball with 
radius Rp around point xo in which lp elements of the training set fall, 
and then using only these training data, constructs the decision function 
that nunimizes the probability of errors in the chosen neighborhood. The 
solution to this problem is a radius Rp that minimizes the functional 



^{Rp^lp) 



(5.32) 



(the parameter \wq\ depends on the chosen radius as well). This functional 
describes a trade-off between the chosen radius Rp, the value of the mini- 
mum of the norm |wo|, and the number of training vectors lp that fell into 
radius Rp. 

Radial Basis Function Machines 

Classical Radial Basis Function (RBF) Machines use the following set of 
decision rules: 



f(x) = sign y-OiK^x - Xi\) - 6 



(5.33) 



where iCyflx — Xi\) depends on the distance |x — Xi\ between two vectors. 
For the theory of RBF machines see (Micchelli, 1986), (Powell, 1992). 

The function K y (\x - xi |) is for any fixed 7, a non-negative monotonia 
function; it tends to zero as z goes to infinity. The most popular function 
of this type is 



K y {\x - Xi\) = exp{-7|z - Xif}. 
Tb construct the decision rule (5.33) one has to estimate 



(5.34) 



(i) The value of the parameter 7, 

(ii) the number N of the centers Xi, 

(iii) the vectors x<, describing the centers, 

(iv) the value of the parameters a». 

In the classical RBF method the first three steps (determining the param- 
eters 7, N, and vectors (centers) xu i = l,...,iV) are based on heuristics 
and only the fourth step (after finding these parameters) is determined by 
minimizing the empirical risk functional. 

The radial function can be chosen as a function for the convolution of the 
inner product for a SV machine. In this case, the SV machine will construct 
a function from the set (5.33). One can show (Aizerman, Braverman, and 
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Rozonoer, 1964, 1970) that radial functions (5.34) satisfy the condition of 
Theorem 5.3. 

In contrast to classical RBF methods, in the SV technique all four types 
of parameters are chosen to minimize the bound on probability of test error 
by controlling the parameters R,w 0 'm the functional (5.29). By minimizing 
the functional (5.29) one determines 

(i) N y the number of support vectors, 

(ii) Xu (the pre-images of) support vectors; 

(iii) Oi = a^, the coefficients of expansion, and 

(iv) 7, the width parameter of the kernel-function. 
Two-Layer Neural Networks 

Finally, one can define two-layer neural networks by choosing kernels: 
K(x,Xi) = S[v(x*Xi) + c], 

where S(u) is a sigmoid function. In contrast to kernels for polynomial 
machines or for radial basis function machines that alway satisfy Mercer 
conditions, the sigmoid kernel tanh(vu+c), \u\ < 1, satisfies Mercer condi- 
tions only for some values of parameters «, c For these values of parameters 
one can construct SV machines implementing the rules: 

f (*,<*) = sign |^QEi5(t?(ar . x { ) + c) + 6 J . 

Using the technique described above, the following are found automatically: 

(i) The architecture of the two layer machine, determining the number 
N of hidden units (the number of support vectors), 

(u) the vectors of the weights 14 = x< in the neurons of the first (hidden) 
layer (the support vectors), and 

(iii) the vector of weights for the second layer (values of a). 
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5.7 EXPERIMENTS WITH SV MACHINES 

In the following we will present two types of experiments constructing the 
decision rules in the pattern recognition problem 8 : 



8 The experiments were conducted in the Adaptive System Research Depart- 
ment, AT&T Bell Laboratories. 
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FIGURE 5.5. Two classes of vectors are represented in the picture by black and 
white balls. The decision boundaries were constructed using an inner product of 
polynomial type with d = 2. In the pictures the examples cannot be separated 
without errors; the errors are indicated by crosses and the support vectors by 
double circles. 

(i) Experiments in the plane with.artificial data that can be visualized, 
and 

(ii) experiments with real-life data. 



5. 7. 1 Example in the Plane 

To demonstrate the SV technique we first give an artificial example (Fig. 
5 5) 

The two classes of vectors are represented in the picture by black and 
white balls. The decision boundaries were constructed using a inner prod- 
uct of polynomial type with d = 2. In the pictures the examples cannot 
be separated without errors; the errors are indicated by crosses and the 
support vectors by double circles. 

Notice that in both examples the number of support vectors is small 
relative to the number of training data and that the number of traming- 
errors is minimal for polynomials of degree two. 

5.7.2 Handwritten Digit Recognition 

Since the first experiments of Rosenblatt, the interest in the problem of 
learning to recognize handwritten digits has remained strong. In the fol- 
lowing we describe results of experiments on learning the recognition of 
handwritten digits using different SV machines. We also compare these re- 
sults to results obtained by other classifiers. In these experiments, the U.S. 
Postal Service database (LeCun et al, 1990) was used. It contains 7,300 
training patterns and 2,000 test patterns collected from real-life zip-codes. 
The resolution of the database is 16x 16 pixels, therefore the dimensionality 
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Classifier 


Raw error% 


Human performance 


2.5 


Decision tree, C4.5 


16.2 


Best two-layer neural network 


5.9 


Five-layer network (LeNet 1) 


5.1 



TABLE 5.1. Human performance and performance of the various learning ma- 
chines, solving the problem of digit recognition on U.S. Postal Service data. 

of the input space is 256. Figure 5.6 gives examples from this data-base. 

Table 5.1 describes the performance of various classifiers, solving this 
problem. 9 

For constructing the decision rules three types of SV machines were 
used 10 : 

(i) A polynomial machine with convolution function: 

= (w)*' d=1 '"' 7 - 

(ii) A radial basis function machine with convolution function: 

Jf(x,x i ) = exp|- (X 25 ^, ) }. 

(iii) A two layers neural network machine with convolution function: 

All machines constructed ten classifiers, each one separating one class from 
the rest. The ten class classification was done by choosing the class with 
the largest classifier output value. 

The results of these experiments are given in Table 5.2. For different 
types of SV machines, Table 5.2 shows: the best parameters for the mar 
chines (column 2), the average (over one classifier) of the number of support 
vectors, and the performance of machine. 




9 The result of human performance was reported by J. Bromley and E. 
SacJonger; the result of C4.5 was obtained by C. Cortes; the result for the two 
layer neural net was obtained by B. Scholkopf; the results for the special purpose 
neural network architecture with five layers (LeNet 1), was obtained by Y. LeCun 
etal 

10 The results were obtained by C. Burges, C. Cortes, and B. Scholkopf. 
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FIGURE 5.6. Examples of patterns (with labels) from the U.S. Postal Service 
database. 
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Type of 


Parameters 


Number of 


Raw 


SV classifier 


of classifier 


support vectors 


error 


Polynomials 


\ d=3 


274 


4.0 


RBF classifiers 


a 2 = 0.3 


291 


4.1 


Neural network 


6 = 2, c= 1 


254 


4.2 



TABLE 5.2. Results of digit recognition experiments with various SV machines 
using the U.S. Postal Service database. The number of support vectors means 
the average per classifier. 





Poly 


RBF 


NN 


Common 


total# of sup.vect. 


1677 


1727 


1611 


1377 


% of common sup. vect. 


82 


80 


85 


100 



TABLE 5.3. Total number (in ten classifiers) of support vectors for various SV 
machines and percentage of common support vectors. 



Note that for this problem, all types of SV machines demonstrate ap- 
proximately the same performance. This performance is better than the 
performance of any other type of learning machine solving the digit recog- 
nition problem by constructing the entire decision rules on the basis of the 
U.S. Postal Service database. 11 



In these experiments one important singularity was observed: different 
types of SV machines use approximately the same set of support vectors. 
The percentage of common support vectors for three different classifiers 
exceeded 80%. 

Table 5.3 describes the total number of different support vectors for ten 
classifiers of different machines: Polynomial machine (Poly), Radial Basis 
Function machine (RBF), and Neural Network machine (NN). It shows also 
the number of common support vectors for all machines. 



"Note that using a local approximation approach described in Section 5.7 (that 
does not construct entire decision rule but approximates the decision rule in any 
point of interest) one can obtain a better result: 3.3% error rate (L. Bottou and 
V Vapnik, 1992). 

The best results for this database, 2.7% was obtained by P. Simard, Y. LeCun, 
and J. Denker without using any learning methods. They suggested a special 
method of elastic matching with 7200 templates using a smart concept of distance 
(so-called Tangent distance) that takes into account invariance with respect to 
small translations, rotations, distortions, and so on (P. Simard, Y. LeCun, and 
J. Denker, 1993). 
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Poly 


RBF 


NN 


Poly 


100 


84 


94 


RBF 


87 


100 


88 


NN 


91 


82 


100 



TABLE 5.4. Percentage of common (total) support vectors for two SV machines. 



Table 5.4 describes the percentage of support vectors of the classifier 
given in the columns contained in the support vectors of the classifier given 
in the rows. 



This feet, if it holds true for a wide 
important. 



class of real-life problems, is very 



5.7.3 Some Important Details 

In this subsection we give some imp ort ant details on solving the digit recog- 
nition problem using a polynomial SV machine. 

The training data are not linearly separable. The total number of mis- 
classifications on the training set for linear rules is equal to 340 (« 5% 
errors). For second degree polynomial classifiers the total number of mis- 
classifications on the training set is down to four. These four mis-classified 
examples (with desired labels) are shown in Fig. 5.7.Starting with polyno- 
mials of degree three, the training data are separable. 

Table 5.5 describes the results of experiments using decision polynomials 
(ten polynomials, one per classifier in one experiment) of various degrees. 
The number of support vectors shown in the table is a mean value per 
classifier. 




FIGURE 5.7. Labeled examples of training errors for the second degree polyno- 
mials. 
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degree of 


dimensionality of 


support 


raw 


polynomial 


feature space 


vectors 


error 


Is 1 


256 


282 


8.9 


2 


~ 33000 


227 


4.7 


3 


~ 1 x 10 6 


274 


4.0 


4 


~lxl(P 


321 


4.2 


5 


~ 1 x 10 12 


374 


4.3 


6 


~ 1 x 10 14 


377 


4.5 


7 


~lxl0 16 


422 


4.5 



TABLE 5.5. Results of experiments with polynomials of the different degrees. 



Note that the number of support vectors increases slowly with the degree 
of the polynomials. The seventh degree polynomial has only 50% more 
support vectors than the third degree polynomial. 12 

The dimensionality of the feature space for a seventh degree polynomial 
is however 10 10 times larger than the dimensionality of the feature space 
for a third degree polynomial classifier. Note that the performance does 
not change significantly with increasing dimensionality of the space — in- 
dicating no overfitting problems. 

To choose the degree of the best polynomials for one specific classifier we 
estimate the VC dimension (using the estimate [R 2 A 2 ]) for all constructed 
polynomials (from degree two up to degree seven) and choose the one with 
the smallest estimate of the VC dimension. In this way we found the ten 
best classifiers (with different degrees of polynomials) for the ten two-class 
problems. These estimates are shown on Fig. 5.8 where for all ten two-class 
decision rules, the estimated VC dimension, is plotted versus the degree of 
the polynomials. The question is: 

Do the polynomials with the smallest estimate of the VC dimension pro* 
vide the best classifier? 

Tb answer this question we constructed Table 5.6 which describes the 
performance of the classifiers for each degree of polynomial. 

Each row describes one two-class classifier separating one digit (stated 
in the first column) from the all ather digits. 

The remaining columns contain: 

deg.: the degree of the polynomial as chosen (from two up to seven) 
by the described procedure, 



12 The relatively high number of support vectors for the linear separator is 
due to nonseparability: the number 282 includes both support vectors and miss- 
classified data. 
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FIGURE 5.8. The estimate of the VC dimension of the best element of the struc- 
ture (denned on the set of canonical hyperplanes in the corresponding feature 
space) versus the degree of polynomial for various two-class digit recognition 
problems (denoted digit versus the rest). 
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Chosen classifier 


Number of test errors 


Dipt 


deg. dim. 


hest. 


1 


2 3 


4 5 


6 


7 


0 


3 


~10 6 


530 


36 


14 


11 


11 


11 


12 


17 


1 


7 


~10 16 


101 


17 


15 


14 


11 


10 


10 


10] 


2 


3 


~10 6 


842 


53 


32 




28 




26 


28 


27. 


32 


3 


3 


~10 6 


1157 


57 


25 




22 




22 


22 


22 


23 


4 


4 


~10 9 


962 


50 


32 


32 


30 


30 


29 


33 


5 


: 3 


~10 6 


1090 


37 


20 


22 


24 


24 


26 


28 


6 


4 


~10 9 


626 


23 


12 


12 


|15| 


17 


17 


19 


7 


5 


~10 12 


530 


25 


15 


12 


10 


11 


13 


14 


8 


4 


~ 10 9 


1445 


71 


33 


28 


1*1 


28 


32 


34 


9 


5 


~10 12 


1226 


51 


18 


15 


11 


11 


12 


15 



TABLE 5.6. Experiments on choosing the best degree of polynomial. 

dim.: the dimensionality of the corresponding feature space, which is 
also the maximum possible VC dimension for linear classifiers in that 
space, 

h^t.' the VC dimension estimate for the chosen polynomial, (which 
is much smaller than the number of free parameters), 

Number of test errors: the number of test errors, using the constructed 
polynomial of corresponding degree; the Wees show the number of 
errors for the chosen polynomial. 

Thus, Table 5.5 shows that for the SV polynomial machine there are no 
overfitting problems with increasing degree of polynomials, while Table 5.6 
shows that even in situations where the difference between the best and 
the worst solutions is small (for polynomials starting from degree two up 
to degree seven), the theory gives a method for approximating the best 
solutions (finding the best degree of the polynomial). 

Note also that Table 5.6 demonstrates that the problem is essentially 
nonlinear. The difference in the number of errors between the best polyno- 
mial classifier and the linear classifier can be as much as a factor of four 
(for digit 9). 
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5-8 REMARKS ON SV MACHINES 

The quality of any learning machine is characterized by three main com- 
ponents: 
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(i) How universal is the learning machine? 

How rich is the set of functions that it can approximate? 

(ii) How well can the machine generalize? 

How close is the upper bound on the error rate that this machine 
achieves (implementing a given set of functions and a given structure 
on this set of functions) to the smallest possible? 

(iii) How fast does the learning process for this machine converge? 
How many operations does it take to find the decision rule, using a 
given number of observations? 



We address these in turn below. 

(i) SV machines implement the sets of functions 



/(x, a, w) = sign ( V OiK(x, u)<) 



(5.35) 



where N is any integer (N < £) 7 o^, i = l,...,iV are any scalars and 
w iy i = 1, . . , N are any vectors. The kernel K(x> w) can be any symmetric 
function satisfying the conditions of Theorem 5.3. 

As was demonstrated, the best guaranteed risk for these sets of functions 
is achieved when the vectors of weights wi, „.wx are equal to some of the 
vectors x from the training data (support vectors). 

Using the set of functions 

f{x y a,w) = yiOiK(x,Wi)-b 
support vectors 

with convolutions of polynomial, radial basis function, or neural network 
type, one can approximate a continuous function to any degree of accuracy. 

Note that for the SV machine one does not need to construct the archi- 
tecture of the machine by choosing a priori the number N (as is necessary 
in classical neural networks or in classical radial basis function machines). 

Furthermore, by changing only the function K (x, w) in the SV machine 
one can change the type of learning machine (the type of approximating 
functions). 

(ii) SV machines minimize the upper bound on the error rate for the 
structure given on a set of functions in a feature space. For the best solution 
it is necessary that the vectors Wi in Eq. (5.35) coincide with some vectors 
of the training data (support vectors 13 ). SV machines find the functions 



l? This assertion is a direct corollary of the necessity of the Kuhn-Tucker con- 
ditions for solving the quadratic optimization problem described in Section 5.4. 
The Kuhn-Tucker conditions are necessary and sufficient for the solution of this 
problem. 
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from the set (5.35) that separate the training data and belong to the subset 
with the smallest bound of the VC dimension. (In the more general case 
they minimiz e the bound of the risk (5.1).) 

(Hi), Finally, to find the desired function, the SV machine has to max- 
imize a non-positive quadratic form in the non-negative quadrant. This 
problem is a particular case of a special quadratic progr amming problem: 
to maximize a non-positive quadratic form Q(x) with bounded constraints 

a* <x* < &», i — l,...,n, 

where z 4 , i = l,...,n are the coordinates of the vector x and a*, 6< 
are given constants. For this specific quadratic programming problem fast 
algorithms exist. 



5-9 SV MACHINES FOR THE REGRESSION 
n ESTIMATION PROBLEM 



5.9.1 e-In$ensitive Loss-Function 

In Section 1.3.2 to describe the problem of approximation of the super- 
visor's rule F(y\x) for the case where y is real valued we considered a 
quadratic loss-function 



L(yJ(x,a)) = {y-f(x,a)) 2 . 



(5.36) 



Using the ERM inductive principle and this loss-function one obtains a 
function that gives the best least squares approximation to the data. Un- 
der conditions where y is the result of measuring a function with normal 
additive noise (see Section 1.7.3) (and for the ERM principle) this loss- 
function provides also the best approximation to the regression. 

It is known, however, that if additive noise is generated by another law, 
the optimal approximation to the regression (for the ERM principle) gives 
another loss-function (associated with this law). 

In 1964 Huber developed a theory that allows us to define the best loss- 
function for the problem of regression estimation on the basis of the ERM 
principle if one has only general information about the model of the noise. In 
particular, he showed that if one only knows that the density p(x) describing 
the noise is a symmetric convex function possessing second derivatives, then 
the best minimax approximation to regression (the best approximation for 
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the worst possible p(x)) provides the loss-function 14 

L(yJ(x f a)) = \y-f(x,a)\. (5.37) 

Minimizin g the empirical risk with respect to this loss-function is called the 
least modulus method. It defines the so-called robust regression function. 

We consider a slightly more general type of loss-function than (5.37), the 
so-called linear loss-function with insensitive zone: 

|y-/(*>a)U = | |y -/(*,<*)!, otherwise. 1 ^ 

This loss-function describes the e-insensitive model: the loss is equal to e if 
the discrepancy between the predicted and actual value is less than e and is 
equal to the discrepancy otherwise. The loss-function (5.37) is a particular 
case of this loss-function for e = 0. 

5.9.2 Minimizing the Risk Using Convex Optimization 
Procedure 

The support vector type approximation to regression takes place if: 

(i) One estimates the regression in the set of linear functions 

f(x,a)==(w-x) + b. 

(ii) One defines the problem of regression estimation as the problem 
of risk nmiimization with respect to an e-insensitive (e > 0) loss- 
function (5.38). 

(iii) One minimizes the risk using the SRM principle, where elements of 
the structure S n are defined by inequality 

(w-w)<Cn. (5.39) 
1. Indeed, suppose we are given training data 

Then the problem of finding the w t and 6/ that minimize the empirical risk 



Rempiw, b) = - £ 1* ( w ' **) ~ 6|e 
i=i 



14 This is an extreme case where one has minimal information about an un- 
known density. Huber described also the intermediate cases where the unknown 
density is a mixture of some given density and any density from a described set 
of densities, taken in proportion € and 1 - c (Huber, 1964). 



(5.37) 

sailed the 
motion. 
5.37), the 



(5.38) 

xal toe if 
is and is 
►articular 



if: 



problem 
0) loss- 

ments of 
(5.39) 



rical risk 



t an un- 
xnknown 
ribed set 



5.9. SV Machines for the Regression Estimation Problem 153 

under constraint (5.39) is equivalent to the problem of finding the pair w y b 
that rnimmiizes the quantity defined by slack variables ft, ft*, % = 1, t 

-f EC + (5.40) 



under constraints 



yi-(w*Xi)-b < e + £, i = l,...,*, 
(u?-Xt) + 6-y< < e + ft, * = 

6 > 0, i = l, 



(5.41) 



and constraint (5.39). 

As before to solve the optimization problem with constraints of inequality 
type one has to find a saddle point of the Lagrange functional 



t=i 



L K r . 6 <**, a, C% 7 , 7*) - +6)-2 a < fc-(m.x < )-6 + e + ft 



of [(w • x,) + 6 - * + e + £] - — (cn - (ui ■ in)) - £( 7 ?£* + 

(Minimum with respect to elements w, 6, ft*,, and £* and maYimum with 
respect to Lagrange multipliers C* > 0, a* > 0, a* > 0, 7? > 0, and 
7. >0,* = 1, ...,£) 

Miriimization witn respect to w, b and ft*, ft implies the following three 
conditions: 



C* 



w 



(5.43) 
(5.44) 

(5.45) 

Putting (5.43) and (5.44) into (5.42) one obtains that, for solution of this 
optimization problem, one has to find the maximum of the convex func- 
tional 



i=l i=l 

0<o?<l, 1 = 1,...,/, 

o<ai<h i = i,...,e, 



<=1 
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1 1 r C* 



subject to constraints (5.44), (5.45) and constraint 

C* >0. 

As in pattern recognition here only some of the parameters in expansion 
(5.43) 

ft = g « > t = !,...,£ 
differ from zero. They define the support vectors of the problem. 

2. One can reduce the convex optimization problem of finding the vec- 
tor w to a quadratic optimization problem, if instead of minimiTring the 
functional (5.40), subject to constraints (5.41) and (5.39), one minimizes 

»(«,r,o = \{«> • «> + c + 1>) 

(with given value C) subject to constraints (5.41). In this case to find the 
desired vector 

i=l 

one has to find coefficients o£, a*, » = 1, £ that maximize the quadratic" 
form 



t t 1 t 

(5.47) 



subject to constraints 



i=i t=i 

0 < a* < C, i = l,...,£. 

As in the pattern recognition case, the solution to these two optimization 
problems coincide if C = C*. 

One can show that for any i = 1, the equality 





(5.46) 



>ansion 



be vec- 
ng the 



nd the 
idratic 

(5.47) 



sation 
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holds true. Therefore, for the particular case where e - 0 and y t € {-1, 1} 
the considered optimization problems coincide with those described for 
pattern recognition in Section 5.5.1. 

lb derive the bound on the generalization of the SV machine, suppose 
that distribution F(x,y) = F(y\x)F(x) is such that for any fixed w, b the 
corresponding distribution of the random variable \y-(w-x)- bL has a 
"light tail" (see Section 3.4): 



sup Vfl»-(^)-g< r 



p>2. 



Then according to equation (3.30) one can assert that the solution w t > b* 
of the optimization problem, provides a risk (with respect to loss function 
(5.38)) such that with probability at least 1 - 9 the bound 



(l-a(p)TVe) + 



holds true, where 



a(p) 



and 



>(lng + l)- Info/4) 

e = 4 f . 

Here hn is the VC dimension of the set of functions 

Sn = {\y - (w • a?) - b\ e : (ww)< Cn}. 

5.9.3 SV Machine with Convolved Inner Product 
Constructing the best approximation of the form 

N 

f(x;v,P) = Y,& K M + b 
i=i 

where ft, i = 1, .., JV" are scalars, v u >* = 1, are vectors, and K(-, •) is 
a given function satisfying Mercer's conditions, is analogous to construct- 
ing a linear approximation. It can be conducted both by solving a convex 
optimization problem and by solving a quadratic optimization problem. 

1. Using the convex optimization approach one evaluates coefficients 
A, i in (5.48) as 



(5.48) 
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where o£, C are parameters that maximize the function 



1 * c£7* 



*J=1 



subject to the constraint 



t=l i=l 



and to constraints 



and 



0<a?<l, 0<Oj<l, t=l,,..,£, 
0<ai<l, i = l, 

C* >0. 



Infoi 
5 



2. Using the quadratic optimization approach one evaluates vector w 
(5.48) with coordinates 

where a J, a* are parameters that maximize the function 

€ * 1 1 



subject to the constraint 



£«•*=£' 

<=1 i=l 



and to constraints 



0<a? <C, t = l, 
0<ai<C, i = !,...,£. 



Choosing different kernels K(- t •) satisfying Mercer's condition one con- 
structs different types of learning machine. In particular, the kernel 

^(x,^) = [(x^ i ) + l] r 

gives a polynomial machine. 

By controlling two parameters and e (C and e in the quadratic opti- 
mization approach) one can control the generalization ability, even for high 
degree polynomials in a high-dimensional space. 
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5.10 THE ART OF ENGINEERING VERSUS FORMAL 
INFERENCE 

The existence of neural networks can be considered a challenge for theo- 
reticians. 

From the formal point of view one cannot guarantee that neural networks 
generalize well, since according to theory, in order to control generalization 
ability one should control two factors: the value of the empirical risk and the 
value of the confidence interval. Neural networks, however, cannot control 
either of the two. 

Indeed, to minimize the empirical risk, a neural network must minimize a 
functional that has many local minima. Theory offers no constructive way 
to prevent ending up with unacceptable local minima. In order to control 
the confidence interval one has first to construct a structure on the set of 
functions that the neural network implements and then to control capacity 
using this structure. There are no accurate methods to do this for neural 
networks. 

Therefore from the formal point of view it seems that there should be 
no question as to what type of machine should be used for solving real-life 
problems. 

The reality however is not so straightforward. The designers of neural 
networks compensate the mathematical shortcomings with the high art 
of engineering. Namely, they incorporate various heuristic algorithms that 
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make it possible to attain reasonably local minima using a reasonably small 
number of calculations. 

Moreover, for given problems they create special network architectures 
which both have an appropriate capacity and contain "useful" functions for 
solving the problem. Using these heuristics, neural networks demonstrate 
surprisingly good results. 

In Chapter 5, describing the best results for solving the digit recognition 
problem using the U.S. Postal Service database by constructing an entire 
(not local) decision rule we gave two figures: 

5.1% error rate for the neural network LeNet 1 (designed by Y. Le- 
Cun), 

4.0% error rate for a polynomial SV machine. 

We also mentioned the two best results: 

3.3% error rate for the local learning approach, and the record 

2.7% error rate for tangent distance matching to templates given by 
the training set. 

In 1993, responding to the community's need for benchmarking, the U.S. 
National Institute of Standard and Technology (NKT) provided a database 
of handwritten characters containing 60,000 training images and 10,000 test 
data, where characters are described as vectors in 20 x 20 = 400 pixel space. 

For this database a special neural network (LeNet 4) was designed. The 
following is how the article reporting the benchmark studies (Leon Bottou 
et aZ, 1994) describes the construction of LeNet 4: 

"For quite a long time, LeNet 1 was considered the state of 
the art. The local learning classifier, the SV classifier, and tan- 
gent distance classifier were developed to improve upon LeNet 
1 — and they succeeded in that. However, they in turn mo- 
tivated a search for an improvea neural network architecture. 
This search was guided in part by estimates of the capacity of 
various learning machines, derived from measurements of the 
training and test error (on the large NIST database) as a func- 
tion of the number of training examples. 15 We discovered that 
more capacity was needed. Through a series of experiments in 
architecture, combined with an analysis of the characteristics 
of recognition errors, LeNet 4 was crafted." 

In these benchmarks, two learning machines that construct entire deci- 
sion rules: 



15 V. Vapnik, E. Levin, and Y. LeCun (1994) "Measuring the VC dimension of 
a learning machine," Neural Computation, 6(5), pp. 851-876. 
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ft) LeNet 4, 

(ii) Polynomial S V machine (polynomial of degree four) 
provided the same performance: 1.1% test error 16 . 

The local learning approach and tangent distance matching to 60,000 
templates also gave the same performance: 1.1% test error. 

Recall that for a small (U.S. Postal Service) database the best result (by 
far) was obtained by the tangent distance matching method which uses a 
priori information about the problem (incorporated in the concept of tan- 
gent distance). As the number of examples increases to 60,000 the advan- 
tage^ a priori knowledge decreased. The advantage of the local learning 
approach also decreased with the increasing number of observations. 

LeNet 4, crafted for the NIST database demonstrated remarkable im- 
provement in performance comparing to LeNet 1 (which has 1.7% test 
errors for the NIST database 17 ). 

The standard polynomial SV machine also did a good job. We continue 
the quotation (Leon Bottou, et al> 1994): 

"The S V machine has excellent accuracy, which is most remark- 
able, because unlike the other high performance classifiers it 
does not include knowledge about the geometry of the problem. 
In fact this classifier would do just as well if the image pixel 
were encrypted, e.g., by a fixed random permutation." 

However, the performance achieved by these learning machines is not 
the record for the NIST database. Using models of characters (the same 
that was used for constructing the tangent distance) and 60,000 examples 
of training data, H. Drucker, R. Schapire, and P. Simard generated more 
than 1,000,000 examples which they used to train three LeNet 4 neural 
networks, combined in the special "boosting scheme" (Drucker, Schapire 
and Simard, 1993) which achieved a 0.7% error rate. ' 

Now the SV machines have a challenge — to cover this gap (between 
1.1% to 0.7%). Probably the use of only brute force SV machines and 
60,000 training examples will not be sufficient to cover the gap. Probably 
one has to incorporate some a priori information about the problem at 
hand. 

There are several ways to do this. The simplest one is use the same 
1,000,000 examples (constructed from the 60,000 NISTs prototypes). How- 
ever, it is more interesting to find a way for directly incorporating the 

"Unfortunately one cannot compare these results to the results described in 
Chapter 5. The digits from the NIST database are "easier" for recognition than 
the ones from U.S. Postal Service database. 

/J2^ te , th t* 4 has ad^tage for large 60,000 training examples 

(NIST) database. For a small (U.S. Postal Service) database containing 7,000 
training examples, the network with smaller capacity, LeNet 1, is better. 
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invariants that were used for generating the new examples. For example, 
for polynomial machines one can incorporate a priori information about in- 
variance by using the convolution of an inner product in the form (x T Ax*) d > 
where x and x* are input vectors and A is a symmetric positive definite 
matrix reflecting the invariants of the models. 18 

One can also incorporate another (geometrical) type of a priori infor- 
mation using only features (monomials) XiXjXk formed by pixels which are 
close each to other (this reflects our understanding of the geometry of the 
problem — important features are formed by pixels that are connected to 
each other, rather than pixels far from each other). This essentially reduces 
(by a factor of millions) the dimensionality of feature space. 

Thus, although the theoretical foundations of Support Vector machines 
look more solid than those of Neural Networks, the practical advantages of 
the new type of learning machines still heeds to be proved. 18 * 
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5.11 WISDOM OF STATISTICAL MODELS 

In this chapter we introduced the Support Vector machines which realize 
the Structural Risk Minimization inductive principle by 

(i) Mapping the input vector into a high-dimensional feature space using 
a nonlinear transformation. 

(ii) Constructing in this space a structure on the set of linear decision 
rules according to the increasing norm of weights of canonical hyper- 
planes. 

(iii) Choosing the best element of the structure and the best function 
within this element in order to minimize the bound on error proba- 
bility. 

The implementation of this scheme in the algorithms described in this 
chapter, however, contained one violation of the SRM principle. To define 
the structure on the set of linear functions we use the set of canonical 
hyperplanes constructed with respect to vectors x from the training data. 



l8 B. SchSlkopf considered an intennediate way: he constructed an SV machine, 
generated new examples by transforming the SV images (translating them in the 
four principal directions), and retrained on the support vectors and the new 
examples. For the U.S. Postal Service database, this improves the performance 
from 4.0% to 3.2%. 

18a connection with heuristics incorporated in Neural Networks let me recall 
the following remark by R. Feynman: "We must make it clear from the beginning 
that if a thing is not a science, it is not necessarily bad. For example, love is not 
science. So, if something is said not to be a scienee it does not meair that there 
is something wrong with it; it just means that it is not a science. 5 ' The Feynman 
Lectures on Physics, Addison- Wesley, 3-1, 1975. 
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According to the the SRM principle, the structure has to be defined a priori 
before the training data appear. 

The attempt to implement the SRM principle in toto brings us to a new 
statement of the learning problem which forms a new type of inference. For 
simplicity we consider this model for the pattern recognition problem. 

Let the learning machine that implements a set of functions linear in 
feature space be given £ + k vectors 



*i, 



(5.49) 



drawn randomly and independently according to some distribution func- 
tion. 

Suppose now that these £ + k vectors are randomly divided into two 
subsets: the subset 

for which the string 

describing classification of these vectors is given (the training set), and the 
subset 



for which the classification string should be found by the machine (test 
set). The goal of the machine is to find the rule that gives the string with 
the minimal number of errors on the given test set. 

In contrast to the model of function estimation considered in this book, 
this model looks for the rule that minimizes the number of errors on the 
given test set rather than for the rule minimizing the probability of error 
on the admissible test set. We call this problem the estimation of the values 
of the function at given points. For the problem of estimating the values 
of function at given points the SV machines will realize the SRM principle 
in toto if one defines the canonical hyperplanes with respect to all £ + k 
vectors (5.49). (One can consider the data (5.49) as a priori information. 
A posteriori information is any information about separating this set into 
two subsets.) 

Estimating the values of a function at given points has both a solution 
and a method of solution, that differ from those based on estimating an 
unknown function. 

Consider for example the five digit zip-code recognition problem. 19 The 
existing technology based on estimating functions suggests recognizing the 
five digits xi, . . . , X5 of the zip-code independently: first one uses the rules 
constructed during the learning procedures to recognize digit xi, then one 
uses the same rules to recognize digit X2 and so on. 



19 R>r simplicity we do not consider the segmentation problem. We suppose 
that all five digits of a zip-code are segmented. 
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The technology of estimating the values of a function suggests to recog- 
nizing all five digits jointly: the recognition of one digit, say xi, depends 
not only on the training data and vector x\ y but also on vectors X2, - . . , x$. 
In this technology one uses the rules that are in a special way adapted to 
solving a given specific task. One can prove that this technology gives more 
accurate solutions. 20 

It should be noted that for the first time this new view of the learning 
problem was found due to attempts to justify a structure denned on the 
set of canonical hyperplanes for the SRM principle. 



5.12 WHAT CAN ONE LEARN FROM DIGIT 
RECOGNITION EXPERIMENTS? 

Three observations should be discussed in connection with the experiments 
described in this chapter: 

(i) The structure constructed in the feature space reflects real-life prob- 
lems well. 

(ii) The quality of decision rules obtained does not strongly depend on 
the type of SV machine (polynomial machine, RBF machine, two- 
layer NN). It does, however, strongly depend on the accuracy of the 
VC dimension (capacity) control. 

(in) Different types of machines use the same elements of training data as 
support vectors. 



5.12.1 Influence of the Type of Structures and Accuracy of 
Capacity Control 

The classical approach to estimating multidimensional functional depen- 
dencies is based on the following belief: ^ 

Real-life problems are such that there exists a small number of "strong 
features, " simple functions of which (say linear combinations) approximate 
well the unknown function. Therefore, it is necessary to carefully choose a 
low-dimensional feature space and then to use regular statistical techniques 
to construct an approximation. 



20 Note that the local learning approach described in Section 4.5 can be consid- 
ered as an intermediate model between function estimation and estimation of the 
values of a function at points of interest. Recall that for a small (Postal Service) 
database the local learning approach gave significantly better results (3.3% error 
rate) than the best result based on entire function estimation approach (5.1% 
obtained by LeNet 1, and 4.0% obtained by the polynomial SV niachine). 
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This approach stresses: be careful on the stage of feature selection (this 
is an informal operation) and then use routine statistical techniques. 

The new technique is based on a different belief: 

Real-life problems are such that there exist a large number of "weak fea- 
tures 1 " whose a smart n linear combination approximates the unknown depen- 
dency well Therefore, it is not very important what kind of "weak feature" 
one uses, it is more important to form "smart" linear combinations. 

This approach stresses: choose any reasonable "weak feature space" (this 
is an informal operation), but be careful at the point of making "smart" 
linear combinations. Prom the perspective of SV machines, "smart" linear 
combinations corresponds to the capacity control method. 

This belief in the structure of real-life problems has been expressed many 
times both by theoreticians and by experimenters. 

In 1940, Church made a claim that is known as the Turing-Church 
Thesis 21 : 

All (sufficiently complex) computers compute the same family of func- 
tions. 

In our specific case we discuss the even stronger belief that linear func- 
tions in various feature spaces associated with different convolutions of the 
inner product, approximate the same set of functions if they possess the 
same capacity. 

Church made his claim on the basis of pure theoretical analysis. However 
as soon as computer experiments became widespread, researchers were un- 
expectedly faced a situation that could be described in the spirit of Church's 
claim. 

In the 1970s and in the 1980s a considerable amount of experimental 
research was conducted in solving various operator equations that formed 
ill-posed problems, in particular in density estimation. A common obser- 
vation was that the choice of the type of regularizes Q(f) in (4.32) (de- 
termining a type of structure) is not as important as choosing the correct 
regularization constant 7^) (determirmig opacity control). 

In particular in density estimation using the Parzen window 



a common observation was: if the number of observations are not "very 
small" , the type of kernel function K{u) in the estimator is not as important 



21 Note that the thesis does not reflect some proved fact. It reflects the belief 
in the existence of some law that is hard to prove (or formulate in exact terms). 
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as the value of the constant 7. (Recall that the kernel K(u) in Parzen's 
estimator is determined by the functional and 7 is deterniined by the 
regularization constant.) 

The same was observed in the regression estimation problem where one 
tries to use expansions in different series to estimate the regression function: 
if the number of observations is not "very small" the type of series used is 
not as important as the number of terms in the approximation. All these 
observations were done solving low-dimensional (mostly one-dimensional) 
problems. 

In the described experiments we observed the same phenomena in very 
high-dimensional space. 

5.12.2 SRM Principle and the Problem of Feature 
Construction 

The "smart" linear combination of the large number of features used in 
the SV machine has an important common structure: the set of support 
vectors. We can describe this structure as follows: along with the set of 
weak features (weak feature space) there exists a set of complex features 
associated with support vectors. Let us denote this space 

u = {K{x, xi), . . . , K(x, x N )) 6 U 9 

where 

are the support vectors. In the space of complex features U , we constructed 
a linear decision rule. Note that in the bound obtained in Theorem 5.2 
the expectation of the number of complex features plays the role of the 
dimensionality of the problem. Therefore one can describe the difference 
between the support vector approach and the classical approach in the 
following way: 

To perform the classical approach well requires the human selection (con- 
struction) of a relative small number of "smart features' 7 while the support 
vector approach selects (constructs) a small number of "smart features' au- 
tomatically. 

Note that the SV machines construct the Optimal hyperplane in the 
space Z (space of weak features) but not in the space of complex features. 
It is easy, however, to find the coefficients that provide optimality for the 
hyperplane in the space U (after the complex features are chosen). Moreover 
one can construct in the U space a new SV machine (using the same training 
data). Therefore one can construct two (or several) layers SV machine. In 
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other words one can suggest multi-stage selection of "smart features" As 
we remarked in Section 4.10, the problem of feature selection is however 
quite delicate (recall the difference between constructing sparse algebraic 
polynomials and sparse trigonometric polynomials). 

5.12.3 Is the Set of Support Vectors a Robust Characteristic of 
the Data? J 

In our experiments we observed an important phenomenon: different types 
of SV machines optimal in parameters use almost the same support vectors- 
there exist* i a small subset of the training data (in our experiments less than 
3% to 5% of data) that for the problem of constructing the best decision rule 
is equivalent to the complete set of training data, and that this subset of the 
training data is almost the same for different types of optimal SV machines 
(r^omial machine with the best degree of polynomials, RBF machine 
with the best parameter 7, and NN machine with the best parameter b ) 

The important question is whether this is true for a wide set of real- 
life problems. There exists indirect theoretical evidence that this is quite 
possible. One can show that if a majority vote scheme, based on various 
support vector machines, does not improve performance, then the percent- 
age of common support vectors of these machines must be high. 

It is too early to discuss the properties of SV machines: the analysis of 
these properties now just started. 22 Therefore I would like to finish these 
comments with the following remark 



J^Z^^t^JT ^ 1 ^ 1 a Bur * es ^onstrated that one can 
approximate the obtained decision rule 



/(x)=sign 
by the much simpler decision rules 



M«N, 



z^L^^r^ support Ti Tu (a spedauy — 

To obtain a^pradmately the same performance for the digit recognition prob- 
lem^escnbed in Section 5.7, it was sufficient to use an approximation based on 

rl*~ 1 J.P netah ^ !d su PP° rt Per classifier instead of N = 270 (initially 

obtained) support vectors per classifier. 

This means that for Support Vector machines there exists a regular way to 
synthesize the decision rules possessing the optimal complexity. 
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The SV machine is a very suitable object for theoretical analysis. It 
unifies various conceptual models: 

(i) The SRM model. (That is how the SV machine initially was obtained. 
Theorem 5.1.) 

(ii) The Data Compression model. (The bound in Theorem 5.2 can be 
described in terms of the compression coefficient.) 

(iii) A universal model for constructing complex features. (The convolu- 
tion of the inner product in Hubert space can be considered as a 
standard way for feature construction.) 

(iv) A model of real-life data. ( A small set of support vectors might be suf- 
ficient to characterize the whole training set for different machines.) 

In a few years it will be clear if such unification of models reflects some 
intrinsic properties of learning mechanisms, or if it is the next- cul-de-sac. 
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